Support Transcriber session with Google realtime API #1321

jayeshp19 · 2024-12-31T18:50:53Z

No description provided.

changeset-bot · 2025-01-02T13:54:53Z

🦋 Changeset detected

Latest commit: dc5c258

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
livekit-plugins-google	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

livekit-agents/livekit/agents/multimodal/multimodal_agent.py

examples/multimodal-agent/gemini_agent.py

davidzhao · 2025-01-13T05:40:33Z

livekit-agents/livekit/agents/multimodal/multimodal_agent.py

+        @self._session.on("agent_speech_completed")
+        def _agent_speech_completed():
+            self._update_state("listening")
+            if self._playing_handle is not None and not self._playing_handle.done():


can you include comments on why this is needed?

theomonnom · 2025-01-13T12:26:13Z

livekit-agents/livekit/agents/multimodal/multimodal_agent.py

+        def _agent_speech_completed():
+            self._update_state("listening")
+            if self._playing_handle is not None and not self._playing_handle.done():
+                self._playing_handle.interrupt()


Why we should interrupt here?

Because we call this function when speech is interrupted as well. They likely made some changes, but now Gemini returns server.turn_complete instead of server.interrupted when interrupted. It's confusing. In both cases, we are calling this function.

theomonnom · 2025-01-13T12:27:07Z

livekit-plugins/livekit-plugins-google/livekit/plugins/google/beta/realtime/api_proto.py

 from typing import Any, Dict, List, Literal, Sequence, Union

+from livekit.agents import llm
+
 from google.genai import types  # type: ignore


The code here is hard to follow, really sad we don't have types (it's unclear what is the structure of the dicts)

theomonnom · 2025-01-13T12:28:50Z

livekit-plugins/livekit-plugins-google/livekit/plugins/google/beta/realtime/realtime_api.py

+            self._transcriber.on("input_speech_done", self._on_input_speech_done)
+            self._agent_transcriber.on("input_speech_done", self._on_agent_speech_done)
+        # init dummy task
+        self._init_sync_task = asyncio.create_task(asyncio.sleep(0))


This isn't really doing anything?

theomonnom · 2025-01-13T12:30:36Z

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

jayeshp19 · 2025-01-14T11:08:47Z

This isn't really doing anything?

Yes, it is being called directly from the base class. We need to keep the dummy task unless we wrap it with capabilities.support_truncate in the base class.

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

I don't think we need it, as the transcriber and LLM are independent of each other.

theomonnom · 2025-01-14T12:20:14Z

Yes, it is being called directly from the base class. We need to keep the dummy task unless we wrap it with capabilities.support_truncate in the base class.

I'm not sure to follow, the baseclass is utils.EventEmitter[EventTypes]

I don't think we need it, as the transcriber and LLM are independent of each other.

How do we get the user messages inside the ChatContext?

jayeshp19 · 2025-01-14T13:11:49Z

I'm not sure to follow, the baseclass is utils.EventEmitter[EventTypes]

I mean multimodal.py

How do we get the user messages inside the ChatContext?

from here when audio transcription is done- https://github.com/livekit/agents/pull/1321/files#diff-4b3e6842c9b1bf3130541b6b2fd18dcc7d1b0051285496eca0355e62938d13fbR351

theomonnom · 2025-01-14T13:44:00Z

How do we get the user messages inside the ChatContext?

from here when audio transcription is done- #1321 (files)

What I mean here is that on some bad timings or if the VAD events are different, the data inside the chat context will not be "stable".

E.g;

You could have multiple user messages for only one assistant messages
The user messages could be appended after the assistant message (The order is wrong)
etc..

jayeshp19 · 2025-01-14T16:57:16Z

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

What I mean here is that on some bad timings or if the VAD events are different, the data inside the chat context will not be "stable".

E.g;

You could have multiple user messages for only one assistant messages

The user messages could be appended after the assistant message (The order is wrong)

etc..

User audio is usually processed in real-time, and we receive transcriptions quickly. However, you're right that these scenarios can occur.
Do you have any suggestions on how we can ensure the chat context remains stable and maintains the correct sequence?

…realtime-stt

theomonnom · 2025-01-19T12:13:46Z

User audio is usually processed in real-time, and we receive transcriptions quickly. However, you're right that these scenarios can occur. Do you have any suggestions on how we can ensure the chat context remains stable and maintains the correct sequence?

Maybe we can just ignore for now and don't add them to the ChatContext. (Even if not ideal :/)
Otherwise we just have to implement a sync mechanism, making sure we map new generations to 1-1.

…realtime-stt

jayeshp19 added 3 commits December 31, 2024 17:39

init

2ba5598

updates

4889301

updates

06d60ba

jayeshp19 added 4 commits January 2, 2025 19:58

updates

4800261

updates

97f5040

added testcase

f5249f4

updates

61da56a

jayeshp19 force-pushed the gemini-realtime-stt branch from 6407a82 to 663f44f Compare January 10, 2025 20:43

jayeshp19 added 3 commits January 11, 2025 02:25

updates

a88281b

updates

3ac01ac

updates

aee4c1c

jayeshp19 force-pushed the gemini-realtime-stt branch from 663f44f to aee4c1c Compare January 10, 2025 21:03

jayeshp19 added 5 commits January 11, 2025 04:47

updates

f42cbe1

updates

99c4fa7

addressing sdk update

3911037

ruff

7eb6766

updates

e8617e8

davidzhao reviewed Jan 13, 2025

View reviewed changes

updates

a6b378b

theomonnom reviewed Jan 13, 2025

View reviewed changes

jayeshp19 added 3 commits January 14, 2025 18:21

updates

c624fba

updates

fc10e5e

updates

9e3b020

jayeshp19 added 5 commits January 18, 2025 13:17

Merge branch 'main' of https://github.com/livekit/agents into gemini-…

e88744e

…realtime-stt

Merge branch 'main' of https://github.com/livekit/agents into gemini-…

fecda9f

…realtime-stt

updates

9d8f732

updates

906c22b

updates

25003f1

theomonnom marked this pull request as ready for review January 20, 2025 09:14

theomonnom approved these changes Jan 20, 2025

View reviewed changes

jayeshp19 added 6 commits January 20, 2025 18:44

Merge branch 'main' of https://github.com/livekit/agents into gemini-…

5019d62

…realtime-stt

updates

71685ea

updates

019e90a

unify env vars

8579fb7

updates

e5a413a

updates

fc3c571

jayeshp19 changed the title ~~[draft] Support STT with Google realtime API~~ Support STT with Google realtime API Jan 20, 2025

jayeshp19 changed the title ~~Support STT with Google realtime API~~ Support Transcriber session with Google realtime API Jan 20, 2025

updates

dc5c258

jayeshp19 merged commit 9994f90 into main Jan 20, 2025
13 of 14 checks passed

jayeshp19 deleted the gemini-realtime-stt branch January 20, 2025 15:01

github-actions bot mentioned this pull request Jan 20, 2025

Version Packages #1369

Merged

jayeshp19 mentioned this pull request Jan 20, 2025

feat: initialize session with chatctx in gemini realtime #1331

Closed

github-actions bot mentioned this pull request Jan 20, 2025

Version Packages Ahmetlup38/agents#2

Open

jayeshp19 restored the gemini-realtime-stt branch January 21, 2025 06:55

jayeshp19 mentioned this pull request Jan 24, 2025

refactor(agent): restart agent on config change livekit-examples/gemini-playground#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Transcriber session with Google realtime API #1321

Support Transcriber session with Google realtime API #1321

jayeshp19 commented Dec 31, 2024

changeset-bot bot commented Jan 2, 2025 •

edited

Loading

davidzhao Jan 13, 2025

theomonnom Jan 13, 2025

jayeshp19 Jan 14, 2025

theomonnom Jan 13, 2025 •

edited

Loading

theomonnom Jan 13, 2025

theomonnom commented Jan 13, 2025

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 14, 2025 •

edited

Loading

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 14, 2025 •

edited

Loading

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 19, 2025 •

edited

Loading

Support Transcriber session with Google realtime API #1321

Support Transcriber session with Google realtime API #1321

Conversation

jayeshp19 commented Dec 31, 2024

changeset-bot bot commented Jan 2, 2025 • edited Loading

🦋 Changeset detected

davidzhao Jan 13, 2025

Choose a reason for hiding this comment

theomonnom Jan 13, 2025

Choose a reason for hiding this comment

jayeshp19 Jan 14, 2025

Choose a reason for hiding this comment

theomonnom Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

theomonnom Jan 13, 2025

Choose a reason for hiding this comment

theomonnom commented Jan 13, 2025

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 14, 2025 • edited Loading

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 14, 2025 • edited Loading

jayeshp19 commented Jan 14, 2025

theomonnom commented Jan 19, 2025 • edited Loading

changeset-bot bot commented Jan 2, 2025 •

edited

Loading

theomonnom Jan 13, 2025 •

edited

Loading

theomonnom commented Jan 14, 2025 •

edited

Loading

theomonnom commented Jan 14, 2025 •

edited

Loading

theomonnom commented Jan 19, 2025 •

edited

Loading